pkg/docker: Add event-based Docker container info caching by florianl · Pull Request #1990 · open-telemetry/opentelemetry-ebpf-instrumentation

florianl · 2026-05-04T09:45:55Z

Summary

Implement in-process caching of Docker container metadata in ContainerStore with event-driven invalidation. When containers die or are destroyed, Docker events trigger immediate cache eviction, eliminating stale metadata and reducing repeated API calls.

This reduces CPU consumption in span enrichment pipeline by avoiding redundant Docker API calls. Flamegraph showed ~13.4% CPU in dockerEnricher; caching eliminates most calls for containers that host multiple processes.

Validation

I have read and followed the contributing guidelines
If this enhances / fixes / changes a core feature, I have updated the features documentation and support matrix as needed.

codecov · 2026-05-04T09:53:48Z

Codecov Report

❌ Patch coverage is 63.39869% with 56 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.24%. Comparing base (3929e7b) to head (600f825).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/docker/docker_api_client.go	62.93%	47 Missing and 6 partials ⚠️
pkg/transform/transform_docker.go	66.66%	2 Missing ⚠️
pkg/appolly/discover/docker_decorator.go	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1990      +/-   ##
==========================================
+ Coverage   69.23%   69.24%   +0.01%     
==========================================
  Files         320      320              
  Lines       42323    42460     +137     
==========================================
+ Hits        29301    29401     +100     
- Misses      11325    11349      +24     
- Partials     1697     1710      +13

Flag	Coverage Δ
integration-test	`51.03% <19.04%> (+0.16%)`	⬆️
integration-test-arm	`28.16% <16.99%> (-0.22%)`	⬇️
integration-test-vm-5.15-lts	`?`
integration-test-vm-6.18-lts	`?`
k8s-integration-test	`38.05% <2.38%> (+0.13%)`	⬆️
oats-test	`35.95% <19.04%> (-0.03%)`	⬇️
unittests	`62.20% <70.63%> (+0.17%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

github-actions · 2026-05-04T10:11:04Z

CI Supervisor

Workflow	Job	Last state	Re-running?	Attempt
PR OATS test	kafka	failure	Yes	1/2
PR OATS test	ai	failure	Yes	1/2
PR OATS test	amqp	failure	Yes	1/2
PR OATS test	redis	failure	Yes	1/2
PR OATS test	memcached	failure	Yes	1/2
PR OATS test	http	failure	Yes	1/2
PR OATS test	sql	failure	Yes	1/2
PR OATS test	nats	failure	Yes	1/2
PR OATS test	mongo	failure	Yes	1/2
Pull request checks	Generate and checks	failure	Yes	1/2
Pull request integration tests ARM	shard-0 (1 tests)	failure	Yes	1/2
Pull request integration tests ARM	shard-1 (1 tests)	failure	Yes	1/2
Pull request integration tests ARM	shard-2 (1 tests)	failure	Yes	1/2
Pull request integration tests ARM	shard-3 (1 tests)	failure	Yes	1/2
Pull request K8s integration tests	informer_cache	failure	Yes	1/2
Pull request K8s integration tests	netolly	failure	Yes	1/2
Pull request K8s integration tests	daemonset_multi_node	failure	Yes	1/2
Pull request K8s integration tests	otel	failure	Yes	1/2
Pull request K8s integration tests	restrict_local_node	failure	Yes	1/2
Pull request K8s integration tests	disable_informers	failure	Yes	1/2
Pull request K8s integration tests	netolly_multizone	failure	Yes	1/2
Pull request K8s integration tests	netolly_promexport	failure	Yes	1/2
Pull request K8s integration tests	sharedpidns	failure	Yes	1/2
Pull request K8s integration tests	daemonset_python	failure	Yes	1/2
Pull request K8s integration tests	netolly_multizone_prom	failure	Yes	1/2
Pull request K8s integration tests	owners	failure	Yes	1/2
Pull request K8s integration tests	netolly_tc_promexport	failure	Yes	1/2
Pull request K8s integration tests	netolly_dropexternal	failure	Yes	1/2
Pull request K8s integration tests	daemonset	failure	Yes	1/2
Pull request K8s integration tests	prom	failure	Yes	1/2
Test Docker build	build (Dockerfile, linux/arm64, ubuntu-24.04-arm)	failure	No	2/2
Pull request integration tests	shard-6 (11 tests)	failure	No	2/2

Implement in-process caching of Docker container metadata in ContainerStore with event-driven invalidation. When containers die or are destroyed, Docker events trigger immediate cache eviction, eliminating stale metadata and reducing repeated API calls. This reduces CPU consumption in span enrichment pipeline by avoiding redundant Docker API calls. Flamegraph showed ~13.4% CPU in dockerEnricher; caching eliminates most calls for containers that host multiple processes.

grcevski

LGTM! Pretty cool! I'll let @mariomac review this since I'm only partially familiar with this code, but looks great from my side.

MrAlias

The caching direction makes sense, but I found one correctness issue in the invalidation path that looks blocking.

ContainerStore stores cache entries in byID using the shortened 12-character container ID derived from inspectInfo.ID, but the Docker event stream delivers the full container ID in msg.Actor.ID. In the current code that means the invalidation lookup never finds the cached entry, so die / destroy events do not actually evict stale metadata.

I think the fix should be to key invalidation with the full container ID internally, and only keep the 12-character form for exported metadata if that abbreviated form is still desired.

mariomac · 2026-05-08T08:28:31Z

Nice! Thank you!

In principle LGTM! But I have a question: are the performance tests taken in production/dev scenarios, or are they micro-benchmarks ran with go test -bench?

I would expect that a cache at this level is not needed because the docker_decorator.go file already has a containerByPID cache that is populated when the container data is found, and removed when an EventDeleted event is received. Also keeping the cache there removes the need for any mutex.

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl · 2026-05-12T13:05:52Z

But I have a question: are the performance tests taken in production/dev scenarios, or are they micro-benchmarks ran with go test -bench?

I noticed this issue, when experimenting with OBI and ebpf-profiler in the OTel demo using something like this config:

receivers:
  profiling:
    obi_process_ctx: true
    samples_per_second: 73
    off_cpu_threshold: 0.1
  obi:
    open_port: "1-65535"

So it's not a micro-benchmark but more a learning from a demo environment.

MrAlias

One follow-up correctness concern on the shared Docker metadata cache: the new store-level PID cache can outlive the process lifecycle unless it also gets explicit PID-scoped eviction. The container-event invalidation path makes sense, but I think the store still needs a helper for process-deletion paths so a reused PID cannot pick up stale metadata from a previous container. After that, the existing lifecycle handlers in pkg/appolly/discover/docker_decorator.go on EventDeleted and pkg/transform/transform_docker.go on ProcessEventTerminated should call it so the shared cache stays aligned with PID lifetime.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias

Two correctness issues still look unresolved on the current head, and I left one test suggestion to make the remaining PID-lifecycle bug reproducible.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

mariomac

LGTM despite I'd still remove the ID string field to ensure data consistency, as it's just a substring of FullID

mariomac · 2026-06-03T08:24:07Z

@@ -167,3 +206,101 @@ func ContainerMetadata[T ~string](dst map[T]string, ci *ContainerMeta, stringer
 	out[stringer(attr.ContainerID)] = ci.ID


for data consistency, I'd remove ID field and replace it here by attr.ContainerId[:abbreviationLength]

Stale

MrAlias

One remaining non-blocking cache-efficiency concern inline.

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias

Two follow-up cache lifecycle comments inline. One is an adjacent stale-cache concern in the span path; the other is a current race in the new fullID cache write path.

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias

Sorry for the review churn here. I should have caught these cache lifecycle issues in earlier passes, and I appreciate your patience as we get this tightened up. I found two remaining races that can leave stale Docker metadata in the shared store, plus one lower-severity watcher enablement issue.

MrAlias · 2026-06-05T16:08:33Z

 	)

 	ctxInfo.DockerMetadata = docker.NewStore()
+	ctxInfo.DockerMetadata.Start(ctx)


Starting the Docker watcher here makes it run for every OBI process as soon as common context is built. In Kubernetes mode, or on hosts where Docker metadata is not otherwise used, watchContainerEvents() still retries Docker initialization once per second even though the Docker decorators later bypass themselves.

I think this should be started lazily only after Docker metadata is actually selected/enabled, or guarded by the same Docker-vs-Kubernetes enablement decision used by the decorators.

Added K8sInformer.IsKubeEnabled(), like it is used in other places. Hope this is fine.

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl · 2026-06-15T14:09:07Z

It seems like, the failing CI tests are related to #2304.

grcevski · 2026-06-17T12:36:09Z

Hi @florianl, do you mind syncing with main, the test failures have been resolved.

florianl requested a review from a team as a code owner May 4, 2026 09:45

florianl marked this pull request as draft May 4, 2026 10:58

florianl force-pushed the docker-metadata-cache branch from 2195482 to 6ae924c Compare May 4, 2026 11:33

florianl force-pushed the docker-metadata-cache branch from 6ae924c to 398570b Compare May 4, 2026 11:34

florianl marked this pull request as ready for review May 4, 2026 11:58

grcevski approved these changes May 4, 2026

View reviewed changes

MrAlias previously requested changes May 4, 2026

View reviewed changes

Comment thread pkg/docker/docker_api_client.go Outdated

Comment thread pkg/docker/docker_api_client_test.go Outdated

florianl added 3 commits May 12, 2026 14:19

Merge branch 'main' into docker-metadata-cache

d8e6b4b

use full ID for ContainerMeta.ID

31d6ae9

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

don't drop comment

d516805

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias reviewed May 12, 2026

View reviewed changes

Comment thread pkg/docker/docker_api_client.go Outdated

Comment thread pkg/docker/docker_api_client_test.go Outdated

florianl and others added 3 commits May 13, 2026 13:18

Apply suggestions from code review

3cbd10b

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

use short ID in metadata

b243ba4

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

expose and use InvalidatePID

0ab215b

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias reviewed May 13, 2026

View reviewed changes

Comment thread pkg/docker/docker_api_client.go Outdated

Comment thread pkg/docker/docker_api_client.go Outdated

Comment thread pkg/docker/docker_api_client.go

Comment thread pkg/docker/docker_api_client_test.go Outdated

florianl and others added 4 commits May 18, 2026 16:55

Apply suggestions from code review

5f4e4d5

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

Merge branch 'main' into docker-metadata-cache

105df71

merge pidToID into byPID

99cffdd

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

refactor around abbreviationLength

2e4dbef

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl force-pushed the docker-metadata-cache branch from f70d6b7 to 2e4dbef Compare May 19, 2026 08:26

florianl added 4 commits May 19, 2026 11:05

fix linting

aaad9dd

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

Merge branch 'main' into docker-metadata-cache

56db551

introduce type ContainerID

5450ba4

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

fix tests

ff8a4c1

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

florianl requested a review from mariomac June 2, 2026 08:44

mariomac reviewed Jun 3, 2026

View reviewed changes

MrAlias reviewed Jun 3, 2026

View reviewed changes

Comment thread pkg/docker/docker_api_client.go

extend byID to byContainerID and include metadata

6647265

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias reviewed Jun 4, 2026

View reviewed changes

Comment thread pkg/transform/transform_docker.go

Comment thread pkg/docker/docker_api_client.go

florianl and others added 2 commits June 5, 2026 15:07

Update pkg/docker/docker_api_client.go

61b0c40

Co-authored-by: Tyler Yahn <MrAlias@users.noreply.github.com>

drop custom docker info cache

d00ddfa

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

MrAlias reviewed Jun 5, 2026

View reviewed changes

florianl added 2 commits June 15, 2026 14:03

Merge branch 'main' into docker-metadata-cache

dd98be7

apply feedback

0f7865a

Signed-off-by: Florian Lehner <florian.lehner@elastic.co>

Merge branch 'main' into docker-metadata-cache

600f825

		@@ -167,3 +206,101 @@ func ContainerMetadata[T ~string](dst map[T]string, ci *ContainerMeta, stringer
		out[stringer(attr.ContainerID)] = ci.ID

Conversation

florianl commented May 4, 2026

Summary

Validation

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Supervisor

Uh oh!

grcevski left a comment

Choose a reason for hiding this comment

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mariomac commented May 8, 2026

Uh oh!

florianl commented May 12, 2026

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mariomac left a comment

Choose a reason for hiding this comment

Uh oh!

mariomac Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MrAlias left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

MrAlias Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

florianl Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

florianl commented Jun 15, 2026

Uh oh!

grcevski commented Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented May 4, 2026 •

edited

Loading

github-actions Bot commented May 4, 2026 •

edited

Loading